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Abstract 

Representation learning is currently a very hot topic in modern ma¬ 
chine learning, mostly due to the great success of the deep learning meth¬ 
ods. In particular low-dimensional representation which discriminates 
classes can not only enhance the classification procedure, but also make 
it faster, while contrary to the high-dimensional embeddings can be effi¬ 
ciently used for visual based exploratory data analysis. 

In this paper we propose Maximum Entropy Linear Manifold (MELM), 
a multidimensional generalization of Multithreshold Entropy Linear Clas¬ 
sifier model which is able to find a low-dimensional linear data projection 
maximizing discriminativeness of projected classes. As a result we ob¬ 
tain a linear embedding which can be used for classification, class aware 
dimensionality reduction and data visualization. MELM provides highly 
discriminative 2D projections of the data which can be used as a method 
for constructing robust classifiers. 

We provide both empirical evaluation as well as some interesting the¬ 
oretical properties of our objective function such us scale and affine trans¬ 
formation invariance, connections with PCA and bounding of the expected 
balanced accuracy error. 


1 Introduction 

Correct representation of the data, consistent with the problem and used classi¬ 
fication method, is crucial for the efficiency of the machine learning models. In 
practice it is a very hard task to find suitable embedding of many real-life objects 
in R d space used by most of the algorithms. In particular for natural language 
processing m, cheminformatics [ 16 ] or even image recognition tasks it is still 
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an open problem. As a result there is a growing interest in methods of repre¬ 
sentation learning [8], suited for finding better embedding of our data, which 
may be further used for classification, clustering or other analysis purposes. Re¬ 
cent years brought many success stories, such as dictionary learning [12] or deep 
learning [9]. Many of them look for a sparse [7], highly dimensional embedding 
which simplify linear separation at a cost of making visual analysis nontrivial. 
A dual approach is to look for low-dimensional linear embedding, which has ad¬ 
vantage of easy visualiation, interpretation and manipulation at a cost of much 
weaker (in terms of models complexity) space of transformations. 

In this work we focus on the scenario where we are given labeled dataset in 

and we are looking for such low-dimensional linear embedding which allows 
to easily distinguish each of the classes. In other words we are looking for a 
highly discriminative, low-dimensional representation of the given data. 

Our basic idea follows from the 
observation m that the density esti¬ 
mation is credible only in the low di¬ 
mensional spaces. Consequently, we 
first project the data onto an arbi¬ 
trary ^-dimensional affine submani¬ 
fold V (where k is fixed), and search 
for the V for which the estimated 
densities of the projected classes are 
orthogonal to each other, where the 
Cauchy-Schwarz Divergence is ap¬ 
plied as a measure of discriminative¬ 
ness of the projection, see Fig.[l]for an 
example of such projection preserv¬ 
ing classes’ separation. The work pre¬ 
sented in this paper is a natural ex¬ 
tension of our earlier results [6 ,, where Figure 1: Visualizatoin of sonar dataset 
we considered the one-dimensional using Maximum Entropy Linear Mani- 
case. However, we would like to un- fold with k = 2. 

derline that the used approach needed a nontrivial modification. In the one¬ 
dimensional case we could identify subspaces with elements of the unit sphere 
in a natural way. For higher dimensional subspaces such an identification is no 
longer possible. 

To the authors best knowledge the presented idea is novel, and has not 
been earlier considered as a method of classification and data visualization. 
As one of its benefits is the fact that it does not depend on affine rescaling 
of the data, which is a rare feature of the common classification tools. What 
is also interesting, we show that as its simple limiting one-class case we ob¬ 
tain the classical PCA projection. Moreover, from the theoretical standpoint 
the Cauchy-Schwarz divergence factor can be decomposed into the fitting term, 
bounding the expected balanced misclassification error, and regularizing term, 
simplifying the resulting model. We compute its value and derivative so one can 
use first-order optimization to find a solution even though the true optimization 



2 




should be performed on a Steifel manifold. Empirical tests show that such a 
method not only in some cases improves the classification score over learning 
from raw data but, more importantly, consistently finds highly discriminative 
representation which can be easily visualized. In particular, we show that re¬ 
sulting projections’ discriminativeness is much higher than many popular linear 
methods, even recently proposed GEM model m ■ For the sake of completness 
we also include the full source code of proposed method in the supplementary 
material. 


2 General idea 

In order to visualize dataset in we need to project it onto R /{! for very small 
k (typically 2 or 3). One can use either linear transformation or some complex 
embedding, however choosing the second option in general leads to hard inter- 
pretability of the results. Linear projections have a tempting characteristics of 
being both easy to understand (from both theoretical perspective and practical 
implications of the obtained results) as well as they are highly robust in further 
application of this transformation. 



Figure 2: Visualization of the MELM idea. For given dataset X_, X + we search 
through various linear projections V and analyze how divergent are their density 
estimations in order to select the most discriminative. 

In this work we focus on such class of projections so in practise we are 
looking for some matrix V G M. dxk , such that for a given dataset X G 'K dxN 
projection V T X preserves as much of the important information about X as 
possible (sometimes additionally under additional constraints). The choice of 
the definition of information measure IM together with the set of constraints 
cpi defines a particular reduction method. 

maximize IM(V T X;X,Y) 

veRdxfc 

subject to </^(V), i = 1,..., m. 
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There are many transformations which can achieve such results. For ex¬ 
ample, the well known Principal Component Analysis defines important infor¬ 
mation as data scattering so it looks for V which preserves as much of the X 
variance as possible and requires V to be orthogonal. In information bottleneck 
method one defines this measure as amount of mutual information between X 
and some additional Y (such as set of labels) which has to be preserved. Similar 
approaches are adapted in recently proposed Generalized Eigenvectors for Dis¬ 
criminative Features (GEM) where one tries to preserve the signal to noise ratio 
between samples from different classes. In case of Maximum Entropy Linear 
Manifold (MELM), introduced in this paper, important information is defined 
as the discriminativness of the samples from different classes with orthonormal 
V. In other words we work with labeled samples (in general, binary labeled) and 
wish to preserve the ability to distinguish one class (X_) from another (X + ). 
In more formal terms, our optimization problem is to 

maximize D 0S ([V T X_], [V T X+]) 

veR dxk 

subject to V T V = /, 

where D cs (-, •) denotes the Cauchy-Schwarz Divergence, the measure of how 
divergent are given probability distributions; [•] denotes some density estimator 
which, given samples, returns a probability distribution. The general idea is also 
visualized on Fig. [2j 


3 Theory 

We first discuss the one class case which has mainly introductory character as 
it shows the simplified version of our main idea. 

Suppose that we have unlabeled data Xcl d and that we want to reduce 
the dimension of the data (for example to visualize it, reduce outliers, etc.) to 
k < d. One of the possible approaches is to use information theory and search 
for such ^-dimensional subspace VcM d for which the orthogonal projection of 
X onto V preserves as much information about X as possible. 

One can clearly choose various measures of information. In our case, due 
to computational simplicity, we have decided to use Renyi’s quadratic entropy, 
which for the density / on R k is given by 

H 2 (/) = log f f 2 (x)dx. 

J R k 

One can equivalently use information potential m, which is given as the L 2 
norm of the density ip(/) = f Rk / 2 (x)dx. We need an easy observation that one 
can compute the Renyi’s quadratic entropy for the normal density A/"(m, X) in 
R k [4j: 

H 2 (V(m, E)) = | log(47r) + § log(det E). (1) 

However, in order to compute the Renyi’s quadratic entropy of the discrete 
data we first need to apply some density estimation technique. By joining all 
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the above mentioned steps together we are able to pose the basic optimization 
problem we are interested in. 

Optimization problem 1. Suppose that we are given data X ; and k which de¬ 
notes the dimension reduction. Find the orthonormal base V of the k-dimensional 
subspac V for which the value of 

H 2 ([V t X]) 

is maximal, where [•] denotes a given fixed method of density estimation. 

If we have data X with mean m and covariance X in and k orthonormal 
vectors V = [Vi,..., V&] then the we can ask what will be the mean and covari¬ 
ance of the orthogonal projection of X onto the space spanned by V. It is easy 
to show that it is given by V T m and V T XV. In other words, if we consider data 
in the base given by ortho normal extension of V to the whole the covariance 

of the projected data corresponds to the left upper kxk block submatrix of the 
original covariance. 

We are going to show that if we apply the simplest density estimation of the 
underlying density for projected data given by the maximal likelihood estimator 
over the family of normal densitie^Jthen our optimization problem is equivalent 
to taking first k elements of the base given by PCA. 

Theorem 1. Let X C R d be a given dataset with mean m and covariance X 
and let [-Jtv denote the density estimation which returns the maximum likelihood 
estimator over Gaussian densities. Then 

max{H 2 ([V T X]^) : V G R dxk , V T V = /} 

is realized for the first k orthonormal vectors given by the PCA. 

Proof. By the comments before and @ we have 

H 2 ([V T X] Ar ) = | log(47r) + I log(det(V T XV)). 

In other words we search for these V for which the value of det(V T XV) is 
maximized. Now by Cauchy interlacing theory [2] eigenvalues of V T XV (ordered 
decreasingly) are bounded above by the eigenvalues of X. Consequently, the 
maximum is obtained in the case when V denotes the orthonormal eigenvectors 
of X corresponding to the biggest eigenvalues of X. However, this is exactly the 
first k elements of the orthonormal base constructed by the PCA. □ 

Using analogous reasoning we can also prove the dual result. 

Theorem 2. For X,m, X, [-]jv and k as in the previous theorem 

min{H 2 ([V r X]AA) : V G R dxk , V T V = /} 

is realized for the last k orthonormal vectors defined by the PCA. 

1 We identify those vectors with a linear space spanned over them. 

2 That is for A C V we put [AJjv = cova) ■ V —>• M+. 
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As a result we obtain some general intuition that maximization of the Renyi’s 
quadratic entropy leads to the selection of highly spreaded data, while its min¬ 
imization selects projection where image is very condensed. 

Let us now proceed to the binary labeled data. Recall that D cs can be 
equivalently expressed in terms of Renyi’s quadratic entropy (H 2 ) and Renyi’s 
quadratic cross entropy (Hj): 

Dcs(V) = log J [V T X+] 2 + log J [V r X_] 2 - 2log J [V T X + ][V T X_] 

= -H 2 ([V t X_]) - H 2 ([V r X + ]) + 2H x ([V T X+], [V t X_]). 

Let us recall that our optimization aim is to find a sequence V consisting of k 
orthonormal vectors for which D CS (V) is maximized. 

Observation 1. Assume that the density estimator [•] does not change under 
the affine change of the coordinate systen^ One can show, by an easy modifica¬ 
tion of the theorem by Czarnecki and Tabor '6. Theorem f.l], that the maximum 
of D cs (-) is independent of the affine change of data. Namely, for an arbitrary 
affine invertible map M we have: 

max{D cs (V; X+, X_) : V orthonormal} 

= max{D cs (V; X + , X_) : V linearly independent} 

= max{D cs (V; MX + , MX_) : V orthonormal}. 

The above feature, although typical in the density estimation, is rather un¬ 
common in modern classification tools. 

Similarly to the one-dimensional case, when V G M d , we can decompose the 
objective function into fitting and regularizing terms: 

Dcs(V) = 2H x ([V T X + ], [V t X_]) - (H 2 ([V t X_]) + H 2 ([V t X+])) . 

^ ^ > v- ^ ✓ 

fitting term regularizing term 

Regularizing term has a slightly different meaning than in most of the machine 
learning models. Here it controls number of disjoint regions which appear after 
performing density based classification in the projected space. For one dimen¬ 
sional case it is a number of thresholds in the multithreshold linear classifier, 
for k = 2 it is the number of disjoint curves defining decision boundary, and so 
on. Renyi’s quadratic entropy is minimized when each class is as condensed as 
possible (as we show in Theorem |2|, intuitively resulting in a small number of 
disjoint regions. 

It is worth noting that, despite similarities, it is not the common classification 
objective which can be written as an additive loss function and a regularization 
term 

N 

i(V) = ^^(V r x i ,y i ,x i ) + fi(V), 

i=1 

3 This happens in particular for the kernel density estimation we apply in the paper 
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as the error depends on the relations between each pair of points instead of 
each point independently. One can easily prove that there are no £, D for which 
Dcs(^) = L(V; t, U). Such choice of the objective function might lead to the lack 
of connections with optimization of any reasonable accuracy related metric, as 
those are based on the point-wise loss functions. However it appears that D cs 
bounds the expected balanced accurac}|^] similarly to how hinge loss bounds 0/1 
loss. This can be formalized in the following way. 

Theorem 3. Negative log-likelihood of balanced mis classification in k-dimensional 
linear projection of any non-separable densities f± onto V is bounded by half of 
the Renyi’s quadratic cross entropy of these projections. 

Proof. Likelihood of balanced misclassification over a ^-dimensional hypercube 
after projection through V equals 

[ min{(V T /+)(x), (V T /_)(x)}c?x. 

7[o,i] fe 

Using analogous reasoning to the one presented by Czarnecki [5], using 
Cauchy and other basic inequalities, one can show that 

-log [ min{(V T / + )(x),(V T /_)(x)}dx> |H 2 x (V t / + ,V t /_). 

□ □ 

As a result we might expect that maximizing of the D cs leads to the selection 
of the projection which on one hand maximizes the balanced accuracy over the 
training set (minimizes empirical error) and on the other fights with overfitting 
by minimizing the number of disjoint classification regions (minimizes model 
complexity). 

4 Closed form solution for objective and its gra¬ 
dient 

Let us now investigate more practical aspects of proposed approach. We show 
the exact formulas of both D cs and its gradient as functions of finite, labeled 
samples (binary datasets) so one can easily plug it in to any first-order opti¬ 
mization software. 

Let X + , X_ be fixed subsets of R d . Let V denote the ^-dimensional subspace 
generated by V = [Vi,...,Vfc] E R dxk (we consider only the case when the 
sequence V is linearly independent). We project sets X± orthogonally on V, 
and compute the Cauchy-Schwarz Divergence of the kernel density estimations 
(using Silverman’s rule) of the resulting projections: 

G- J (V)[V T X+] and g ~1(v)[V t X_|, 

4 Accuracy with class priors being ignored BAC = | ( tiGkn + tn+fp )’ 
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where G(V) = V T V denotes the grassmanian. We search for such V for which 
the Cauchy-Schwarz Divergence is maximal. Recall that the scalar product in 
the space of matrices is given by (Vi, V 2 ) = tr(VfV 2 ). 

There are basically two possible approaches one can apply: either search for 
the solution in the set of orthonormal V which generate V, or allow all V with a 
penalty function. The first method is possibl^] but does not allow use of most 
of the existing numerical libraries as the space we work in is highly nonlinear. 
This is the reason why we use the second approach which we describe below. 

Since, as we have observed in the previous section, the result does not depend 
on the affine transformation of data, we can restrict to the analogous formula 
for the sets 

V T X+ and V T X_, 

where V consists of linearly independent vectors. Consequently, we need to 
compute the gradient of the function 

D CS (V) = D CS ([V T X+], [V T X_]) 

= log J [V T X+] 2 + log J [V T X_] 2 - 2log J [V T X+][V T X_], 

where we consider the space consisting only of linearly independent vectors. 
Since as the base of the space V we can always take orthonormal vectors, the 
maximum is realized for orthonormal sequence, and therefore we can add a 
penalty term for being non-orthonormal sequence, which helps avoiding numer¬ 
ical instabilities: 

Dcs(V) — ||V T V — I\\ 2 , 

where as we recall the sequence V is orthonormal iff V T V = /. We denote above 
augmented D cs by the maximum entropy linear manifold objective function 

MELM(V) = D CS (V) - ||V T V - if. (2) 

Besides MELM(-) value we need the formula for its gradient VMELM(-). 
For the second term we obviously have 

V||V T V -If = 4VV T V - 4 V. 

We consider the first term. Let us first provide the formula for the compu¬ 
tation of the product of kernel density estimations of two sets. 

Assume that we are given set A C V (in our case A will be the projection of 
X± onto E), where V is dimensional. Then the formula for the kernel density 
estimation with Gaussian kernel, is given by [ 15 ] : 

I I aeA 


5 And has advantage of having smaller number of parameters. 



where XU = (h\) 2 covA and (for 7 being a scaling hyperparameter El) h A = 

A4- 2 y /{k+4) \ A \- i/{k+4) - 

Now we need the formula for J[A][5], which is calculated [ 6 ] with the use 
of 

J Af{a, E A )U{b , E b ) = Af(a -b,Z A + S B )( 0 ). 


Then we get 



a E mw,ZA+zsm 

1 11 1 weA-B 

_1_ y 

(2ir)^deA 2 (X AB )\A\\B\ w £f_ B 


1 

2 


HI 


2 


), 


where i-B = {a-5:aEi,&G5} and T>ab is defined by 
Sab = ( h A ) 2 cova + (/i^) 2 cov b 

= 7 2 (^) 2/(fe+4) (|A|- 2 /^ 4 )covA + l-B | —2 /C*+ 4 ) cov B ). 

For a sequence V = [Vi,..., V&] G M dx/c of linearly independent vectors we put 
Eab(V) = V T X AB V and S AB (V ) = Eab(V)- 1 . 


Observe that XU#(V) and Sab(V ) are square symmetric matrices which repre¬ 
sent the properties of the projection of the data onto the space spanned over V. 
We put 


ct> AB (y) = 


i 

( 2 y/ 2 det 1 / 2 (EAB(V))|^||B|’ 


thus 

V^ab(V) = -4>AB(y) • Sab ■ v ■ 5ab(V). 


Consequently to compute the final formula, we need the gradient of the function 
V -A det(£AB(V)), which as one can easily verify, is given by the formula 


Vdet(EA B (V)) = 2det(V T EA B V) • EabVO^EabV)" 1 . (3) 


One can also easily check that for 

V’ab(v) = ex p(7ii vTw; iiL B (v))> 

where w arbitrarily fixed, we get 

V^ B (V) = -«b(V) • ( ww t VS ab (V ) - S abVS ab (V)V t ww t VS ab (V)). 

To present the final form for the gradient of D CS (V) we need the gradient of 
the cross information potential 

1Pab(V) = 0ab(V) £ ^b(V). 

wEA — B 

Vip2 b(V) = 4>ab(V) E v «b(V) + ( £ ^ab(V) ) • V^ab(V). 

wEA—B \wEA—B J 
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Since 


Dcs(V) = log(ipx + x + (V)) + l°g(ipLx_(V)) - 21og(ip x +x _(V)), 
we finally get 


VD 0S (V) 


■ipCx^v) V1 ^x + x + 


Vip x 


,(V) + T 


- 2 : 


(v) Vip x+x_ 


!P X _ 

(V). 


(V) 


vipLxJV) 


Given 

MELM(V) = D CS (V) - ||V T V - I\\ 2 , 

VMELM(V) = VD CS (V) - (4VV T V - 4V), 

one can run any first-order optimization method to find vectors V spanning k- 
dimensional subspace V representing low-dimensional, discriminative manifold 
of the input space. 


5 Experiments 

We use ten binary classification datasets from UCI repository [I] and libSVM 
repository [3], which are briefly summarized in Table [I] These are moderate 
size problems. 

Code was written in Python with the use of scikit-learn m , numpy |18 
and scipy. Besides MELM we use 8 other linear dimensionality reduction tech¬ 
niques, namely: Principal Component Analysis (PC A), class PC A (cPCA^]), 
two ellipsoid PC A (2ePCif), per class PC A (pPC/\[^]), Independent Compo¬ 
nent Analysis (ICA), Factor Analysis (FA), Nonnegative Matrix Factorization 
(NMlQ, Disriminative Learning using Generalized Eigenvectors (GEM [TO]). 
PCA, ICA, NMF and FA are implemented in scikit-learn, cPCA, pPCA and 
2 ePCA were coded by authors and for GEM we use publically available codj^| 
Implementation of MELM as a model compatible with scikit-learn classifiers 
and transformers is available both in supplementary materials and onlinj^j 
In order to estimate how hard to optimize is the MELM objective function 
we plot in Fig. [3] histograms of D cs values obtained during 500 random starts 
for each of the dataset. First, one can easily notice that D cs have multiple local 
extrema (see for example heart or liver-disorders histograms). It also appears 
that in some of the considered datasets it is not easy to obtain maximum by 
the use of completely random starting point (see ionosphere and australian 
datasets), which suggests that one should probably use some more advanced 
initialization techniques. 

6 cPCA uses sum of each classes covariances, weighted by classes sizes, instead of whole 
data covariance. 

7 2ePCA is cPCA without weights, so it is a balanced counterpart. 

8 pPCA uses as Vi the first principal component of zth class. 

9 In order to use NMF we first transform dataset so it does not contain negative values. 
10 forked at http://gist.github.com/lejlot/3ab46c7a249d4f375536 
11 http://github.com/gmum/melm 
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dataset 

N 

d 

|X-| 

|X+| 

m 

d .95 

d .95 

df 

australian 

690 

14 

383 

307 

0.80 

1 

2 

1 

breast-cancer 

683 

10 

444 

239 

1.00 

1 

1 

1 

diabetes 

768 

8 

268 

500 

0.88 

2 

2 

3 

fourclass 

862 

2 

555 

307 

1.00 

2 

2 

2 

german, numer 

1000 

24 

700 

300 

0.75 

3 

3 

3 

heart 

270 

13 

150 

120 

0.75 

3 

3 

3 

ionosphere 

351 

34 

126 

225 

0.88 

24 

26 

7 

liver-disorders 

345 

6 

145 

200 

1.00 

3 

3 

3 

sonar 

208 

60 

111 

97 

1.00 

28 

24 

24 

splice 

1000 

60 

483 

517 

1.00 

55 

52 

54 


Table 1: Summary of used datasets. N denote number of points, d dimension¬ 
ality, \Xi\ number of samples with l label, fh mean density (number of nonzero 
elements) and d\ denotes number of dimensions which we have to include during 
PCA to keep t of label l variance. 


To further investigate how hard it is to find a good solution when selecting 
maximum of D cs we estimate the expected value of D cs after s random starts 
from matrices V^\ ..., 

E[ max D cs (L-BFGS(MELM|V))]. 

As one can see on Fig. [4] for 8 out of 10 considered datasets one can expect to 
find the maximum (with 5% error) after just 16 random starts. Obviously this 
cannot be used as a general heuristics as it is heavily dependent on the dataset 
size, dimensionality as well as its discriminativness. However, this experiment 
shows that for moderate size problems (hundreds to thousands samples with 
dozens of dimensions) MELM can be relatively easily optimized even though it 
is a rather complex function with possibly many local maxima. 

It is worth noting that truly complex optimization problem is only given by 
ionosphere dataset. One can refer to Table [l] to see that this is a very specific 
problem where positive class is located in a very low dimensional linear manifold 
(approximately 7 dimensional) while the negative class is scattered over nearly 
4 times more dimensions. 

We check how well MELM behaves when used in a classification pipeline. 
There are two main reasons for such approach, first if the discriminative man¬ 
ifold is low-dimensional, searching for it may boost the classification accuracy. 
Second, even if it decreases classification score as compared to non-linear meth¬ 
ods applied directly in the input space, the resulting model will be much simpler 
and more robust. For comparison consider training a RBF SVM in M 60 using 
1000 data points. It is a common situation when SVM selects large part of the 
dataset as the support vectors HU, m , meaning that the classification of the 
new point requires roughly 500 • 60 = 30000 operations. In the same time if 
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Figure 3: Histograms of D cs values obtained for each dataset during 500 random 
starts using L-BFGS. 


I 

| 



australian 

breast-cancer 

diabetes 

fourclass 

german.numer 

heart 

ionosphere 

liver-disorders 

sonar 

splice 



Figure 4: Expected value of Cauchy-Schwarz Divergence after MELM optimiza¬ 
tion for s random starts using L-BFGS algorithm (on the left) and its ratio to 
the maximum obtainable Cauchy-Schwarz Divergence (on the right). Dotted 
black line shows 16 starts threshold. 


we first embed space in a plane and fit RBF SVM there we will build a model 
with much less support vectors (as the 2D decision boundary generally is not 
as complex as 60-dimensional one), lets say 100 and consequently we will need 
60 • 2 + 2 • 100 = 120 + 200 = 320 operations, two orders of magnitude faster. 
Whole pipeline is composed of: 

1. Splitting dataset into training X_,X + and testing X_,X + . 

2. Finding plane embeding matrix V E M dx2 using tested method. 

3. Training a classifier cl on V T X_, V T X + . 

4. Testing cl onV T X_,V T X+. 

Table [2] summarizes BAC scores obtained by each method on each of the 
considered datasets in 5-fold cross validation. For the classifier module we used 
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SVM RBF, KNN and KDE-based density classification. Each of them was fitted 
using internal cross-validation to find the best parameters. GEM and MELM 7 
hyperperameters were also fitted. Reported results come from the best classifier. 

In four datasets, MELM based embeding led to the construction of better 
classifier than both other dimensionality reduction techniques as well as training 
models on raw data. This suggests that for these datasets the discriminative 
manifold is truly at most 2-dimensional. At the same time in nearly all (besides 
sonar ) datasets the pipeline consisting of MELM yielded significantly better 
classification results than any other embeding considered. 

One of the main applications of MELM is to visualize the dataset through 
linear projection in such a way that classes do not overlap. One can see com¬ 
parisons of heart dataset projections using all considered approaches in Fig. [5] 
As one can notice our method finds plane projection where classes are nearly 
perfectly discriminated. Interestingly, this separation is only obtainable in two 
dimensions, as neither marginal distributions nor any other one-dimensional 
projection can construct such separation. 

While visual inspection is crucial for such tasks, to truly compare compete- 
tive methods we need some metric to measure quality of the visualization. In 
order to do so, we propose to assign a visual separability score as the mean 
BAC score over three considered classifiers (SVM RBF, KNN, KDE) trained 
and tested in 5-fold cross validation of the projected data. The only differ¬ 
ence between this test and the previous one is that we use whole data to find a 
projection (so each projection technique uses all datapoints) and only further vi¬ 
sualization testing is performed using train-test splits. This way we can capture 
”how easy to discriminate are points in this projection” rather than ”how useful 
for data discrimination is using the projection”. Experiments are repeated using 
various random subsets of samples and mean results are reported. 

During these experiments MELM achieved essentially better scores than 
any other tested method (see Table [3|. Solutions were about 10% better under 
our metric and this difference is consistent over all considered datasets. In other 
words MELM finds two-dimensional representations of our data using just linear 
projection where classes overlap to a significantly smaller degree than using 
PCA, cPCA, 2ePCA, pPCA, ICA, NMF, FA or GEM. It is also worth noting 
that Factor Analysis, as the only method which does not require orthogonality 
of resulting projection vectors did a really bad job while working with fourclass 
data even though these samples are just two-dimensional. 


6 Conclusions 

In this paper we construct Maximum Entropy Linear Manifold (MELM), a 
method of learning discriminative low-dimensional representation which can 
be used for both classification purposes as well as a visualization preserving 
classes separation. Proposed model has important theoretical properties includ¬ 
ing affine transformations invariance, connections with PCA as well as bounding 
the expected balanced misclassification error. During evaluation we show that 
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Figure 5: Comparison of heart dataset 2D projections by analyzed methods. 
Visualization uses kernel density estimation. 
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for moderate size problems MELM can be efficiently optimized using simple 
first-order optimization techniques. Obtained results confirm that such an ap¬ 
proach leads to highly discriminative transformation, better than obtained by 
8 compared solutions. 
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